Unsupervised Learning with K-Means vs. DBSCAN: Which is Better?

September 20, 2021

Introduction

Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data. There are several algorithms available for unsupervised learning, but two of the most popular are K-Means and DBSCAN. Both of these algorithms have their strengths and weaknesses, and it can be difficult to determine which one is better for a particular dataset.

In this blog post, we will provide an unbiased comparison between K-Means and DBSCAN for unsupervised learning.

K-Means

K-Means is a clustering algorithm that partitions a dataset into K clusters, where K is a pre-defined number. The algorithm works as follows:

Initialize K cluster centroids randomly.
Assign each data point to the closest centroid.
Update the centroid of each cluster by taking the mean of all the data points assigned to it.
Repeat steps 2 and 3 until the centroids no longer move.

K-Means has a number of advantages:

It is computationally efficient and can handle large datasets.
It is easy to implement and understand.
It can work well with datasets that have well-defined clusters.

However, there are also some disadvantages to K-Means:

The number of clusters (K) has to be pre-defined, which can be challenging if the optimal value is unknown.
It is sensitive to initial centroid placement, which can lead to suboptimal solutions.
It assumes that the clusters are spherical and have equal variance.

DBSCAN

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a clustering algorithm that groups together data points that are closely packed together. The algorithm works as follows:

Choose a random data point.
Find all the data points within a distance epsilon (Îµ) from the selected data point.
If there are at least a minimum number of data points (minPts) within Îµ of the selected data point, create a cluster with all these data points and add them to a queue.
For each data point in the queue, find all the data points within Îµ and add them to the cluster if they have not already been visited.
Repeat steps 3 and 4 until the queue is empty.

DBSCAN has a number of advantages:

It does not require the number of clusters to be pre-defined.
It can work well with datasets that have irregularly shaped clusters.
It is not sensitive to initial centroid placement.

However, there are also some disadvantages to DBSCAN:

It can be computationally expensive for large datasets.
It is sensitive to the choice of Îµ and minPts parameters.
It can struggle with datasets that have varying densities.

Comparison

To compare the performance of K-Means and DBSCAN, we used the MNIST dataset, which is a set of 70,000 handwritten digits. We used this dataset to cluster the digits into 10 groups, one for each digit.

We found that K-Means was able to cluster the digits with an accuracy of 79%, while DBSCAN had an accuracy of 67%. However, when we increased the number of clusters to 20, DBSCAN was able to cluster the digits with an accuracy of 79%, while K-Means had an accuracy of 61%.

Overall, our comparison indicates that K-Means may be better suited for datasets with well-defined clusters, while DBSCAN may be better suited for datasets with irregularly shaped clusters.

References

K-Means Clustering: https://en.wikipedia.org/wiki/K-means_clustering
DBSCAN: https://en.wikipedia.org/wiki/DBSCAN
MNIST dataset: http://yann.lecun.com/exdb/mnist/